A Unified Framework Via Correction for Offline Safe Reinforcement Learning
Type: publication, Submitted to NeurIPS 2024, 2024
Offline safe reinforcement learning (Safe RL) aims to learn an optimal policy from previously collected datasets that maximizes the expected reward while satisfying given cost constraints. Directly applying Safe RL in the offline setting can fail due to the extrapolation error caused by the out-of-distribution actions. Moreover, since the offline dataset may come from unsafe policies, the cost-aware learning process can still be guided toward unsafe trajectories generated by the behavioral policy. To address the challenge of learning both extrapolation error and safe constraints, we introduce the Cost-Corrected Markov Decision Process (CC-MDP). It can corrects unsafe policy learning by reward re-distribution and cost penalty, transforming a safe offline learning problem into a pure offline learning problem without cost constraints. We theoretically demonstrate that CC-MDP have the same optimal value function as its corresponding CMDP in the offline setting. To demonstrate our framework, we combine CC-MDP with common offline RL algorithms. Experiments on various offline Safe RL tasks show that pure offline RL algorithms can achieve competitive rewards while satisfying constraints under our CC-MDP framework.